Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(performance): query improvements for trends (load less people) #23135

Merged
merged 91 commits into from
Jun 25, 2024

Conversation

aspicer
Copy link
Contributor

@aspicer aspicer commented Jun 20, 2024

Problem

MX trends actor queries are failing https://posthog.slack.com/archives/C0368RPHLQH/p1718782039244149

They are OOMing because we have too many persons and PDIs. MX has 50,000,000 PDI entries and 200,000,000 people, and joining events against them takes > 36GB of memory in Clickhouse.

Changes

This adds a new alias for the persons table, called "filterable_persons". This new table looks into the join ON expression, looking for an id IN clause. If it finds one, it brings that inside of the subselect, to prevent us from having to pull all people.

This behavior is similar to the WhereClauseExtractor, but the WhereClauseExtrator is general and conservative. filterable_persons assumes you know what you're doing in the query, and skips the checks. It would be much too hard to expand the WhereClauseExtractor to know that it could bring the entire subselect inside the where.

It achieves reuse by putting a use_query_cache setting with a 600 (HOGQL_INCREASED_MAX_EXECUTION_TIME) second query_cache_ttl on the cache value. This means that if the exact same source query is rerun for the next 10 minutes, the results will be the same, but we shouldn't be allowing people run queries that frequently so it shouldn't be an issue. If it becomes one, we can put some sort of nonce on the CTE to only allow caching in this query.

We also

  • Set optimize_aggregation_in_order for the subquery with grouping on person_distinct queries. This helps (a lot - 16GB to 1GB) with memory usage in a vacuum, but doesn't matter too much once you try to join against it.

Unrelated but included:

  • Lets keyboard shortcuts work if the shift key is reporting 'k' as a capital letter

Followups

  • More generally, we should expand the WhereClauseExtractor and make sure we're using it. This will help inside of event queries where we join against people and then filter on person properties.

How did you test this code?

Wrote tests. Looked at the SQL. Ran a bunch of actors queries with breakdowns and filters locally.

@aspicer aspicer changed the title feat(performance): query improvements for trends feat(performance): query improvements for trends (load less people) Jun 20, 2024
@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Copy link
Contributor

github-actions bot commented Jun 20, 2024

Size Change: 0 B

Total Size: 1.06 MB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 1.06 MB

compressed-size-action

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@aspicer aspicer marked this pull request as ready for review June 21, 2024 00:15
@aspicer aspicer requested review from a team and mariusandra June 21, 2024 00:19
posthog/hogql/constants.py Outdated Show resolved Hide resolved
Comment on lines 268 to 281
WHERE and(equals(person.team_id, 2), in(id,
(SELECT source.actor_id AS actor_id
FROM
(SELECT actor_id AS actor_id, count() AS event_count, groupUniqArray(100)(tuple(timestamp, uuid, `$session_id`, `$window_id`)) AS matching_events
FROM
(SELECT e.person_id AS actor_id, toTimeZone(e.timestamp, 'UTC') AS timestamp, e.uuid AS uuid, e.`$session_id` AS `$session_id`, e.`$window_id` AS `$window_id`
FROM events AS e
LEFT JOIN
(SELECT argMax(replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(groups.group_properties, 'industry'), ''), 'null'), '^"|"$', ''), toTimeZone(groups._timestamp, 'UTC')) AS properties___industry, groups.group_type_index AS index, groups.group_key AS key
FROM groups
WHERE and(equals(groups.team_id, 2), ifNull(equals(index, 0), 0))
GROUP BY groups.group_type_index, groups.group_key) AS e__group_0 ON equals(e.`$group_0`, e__group_0.key)
WHERE and(equals(e.team_id, 2), equals(e.event, 'sign up'), greaterOrEquals(toTimeZone(e.timestamp, 'UTC'), toDateTime64('2020-01-02 00:00:00.000000', 6, 'UTC')), less(toTimeZone(e.timestamp, 'UTC'), toDateTime64('2020-01-03 00:00:00.000000', 6, 'UTC')), ifNull(equals(e__group_0.properties___industry, 'technology'), 0)))
GROUP BY actor_id SETTINGS use_query_cache=1, query_cache_ttl=600) AS source)))
Copy link
Member

@Twixes Twixes Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case of the optimization kicking in seems massive, but seems legit at a glance. I do wonder how fast this is, given the in()'s SELECT is triple-nested. 😅 Probably depends on whether the events time range is large (can be months for an aggregate query) or tiny (a single day). (This is not actionable, I'm just wondering)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23135 (comment)

This sort of got buried but discussed the performance difference with Marius above. I think this will (almost) always make things faster, because if you have enough events to have the inner queries run slow, you will also have a lot of people for the inefficient join.

If this ends up slowing down small users, we can figure out a way to create a set of "large user optimizations" that maybe only get turned on once people hit some event limit.

This will be much faster once materialized CTEs get landed in Clickhouse.

@Twixes
Copy link
Member

Twixes commented Jun 25, 2024

I like this a lot! Especially without filterable_persons being separate. Though seems like some tests might need to be updated (they aren't running as long as there's a merge conflict).

@aspicer aspicer merged commit 635e08c into master Jun 25, 2024
91 checks passed
@aspicer aspicer deleted the aspicer/group_pdi branch June 25, 2024 22:08
@aspicer
Copy link
Contributor Author

aspicer commented Jun 25, 2024

image

This change dropped memory usage on actors queries from 34 GB to 1 GB for MX. 💪

Copy link

sentry-io bot commented Jun 26, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ SyntaxError: mismatched input '_matching_events' expecting posthog.tasks.tasks.process_query_task View Issue
  • ‼️ ValueError: Missing both funnelStep and funnelCustomSteps posthog.tasks.tasks.process_query_task View Issue
  • ‼️ SyntaxError: mismatched input '_matching_events' expecting posthog.tasks.tasks.process_query_task View Issue
  • ‼️ TypeError: dateutil.relativedelta.relativedelta() argument after ** must be a mapping, not NoneType posthog.tasks.tasks.process_query_task View Issue
  • ‼️ ValueError: Missing both funnelStep and funnelCustomSteps posthog.tasks.tasks.process_query_task View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants